Data Preparationg
Select columns of season
, teamName
, playerName
, position
, pass_accura
, tackle_blocks
, tackle_intercep
, and fouls_committed
in the player dataset. Then, choose players whose position is the defender and playing season from 2020 to 2022. Dropping rows that are pass_accura
, tackle_blocks
, tackle_intercep
, and foul_committed
all are equal to 0. The main reason is these players are transferred by clubs at the beginning of the season or have never player in La Liga. After generating a new data set.
Since the new dataset does not have a label column. Therefore, using k-mean clustering based on numeric variables of pass_accura
, tackle_bolcks
, tackle_intercep
and fouls_committed
. The k-value used here is 3 (Categories: Good performance, normal performance, and bad performance). After K-mean clustering, the label columns are merged into the new dataframe, as shown below.
Train Data and Test Data Splitting
playerTrainDF, playerTestDF = train_test_split(player,test_size = 0.3, random_state=42)
playerTrainDF.to_csv("DTTrainDF.csv")
playerTestDF.to_csv("DTTestDF.csv")
print(playerTrainDF)
print(playerTestDF)
Also using the train_split function as same in the Multinomial NB section, and getting the train dataset and test dataset. Here the test is divided into 30% of the new dataframe. Also, use reandom_state=42 to ensure that the train set and test set do not change once running code every time. Creating a disjoint split is essential for the prediction model. If there are overlapping samples in the training set and test dataset, the model will view the labels in the test set and remember them during training. It can lead to errors in the evaluation of the performance of the model. Just as a student knows the answers to an exam in advance, a teacher cannot determine whether this student has really learned content by the test score.
Using the code below to check whether train set and test set disjoint splitting or not.
disjoint_check = pd.merge(playerTrainDF, playerTestDF, how='inner')
# Check Train and Test sets are disjoint
if not disjoint_check.empty:
print("Train and Test set have same rows")
else:
print("Train and Test set have not same rows")
![]() |
![]() |
---|---|
The Train DatasetTrain Set | The Test DatasetTest Set |
After that, the train set and test will be fine-tuned using the following code in preparation for use with the decision tree algorithm.
playerTestLabel = playerTestDF['performance']
playerTestDF = playerTestDF.drop(['performance'],axis = 1)
playerTrainDF_nolabel = playerTrainDF.drop(['performance'],axis = 1)
playerTrainLabel = playerTrainDF['performance']
dropcols = ['season','teamName','playerName','position']
playerTrainDF_nolabel_quant = playerTrainDF_nolabel.drop(dropcols,axis = 1)
playerTestDF_quant = playerTestDF.drop(dropcols,axis=1)
Resource
Decision Tree Data Preparation Code Python Code